In this project, a data of AirBnb (Netherlands) is collected from Kaggle. On this dataset, all the three types of algorithms are applied which includes Regression, Classification and Clustering.
Models that are implemented are following:
Initially, some useful packages are installed.
install.packages("e1071")
install.packages("caTools")
install.packages("corrplot")
install.packages("devtools")
install.packages("dendextend")
install.packages("tree")
install.packages("zoo")
install.packages("scales")
install.packages("ggmap")
install.packages("stringr")
install.packages("gridExtra")
install.packages("caret")
install.packages("treemap")
install.packages("psych")
install.packages("DAAG")
install.packages("leaps")
install.packages("corrplot")
install.packages("glmnet")
install.packages("boot")
install.packages("naniar")
install.packages("tidyr")
install.packages("DT")
install.packages("ggplot2")
install.packages("dplyr")
install.packages("tidyverse")
install.packages("kableExtra")
install.packages("lubridate")
install.packages("readxl")
install.packages("highcharter")
install.packages("scales")
install.packages("RColorBrewer")
install.packages("wesanderson")
install.packages("plotly")
install.packages("shiny")
install.packages("readr")
install.packages("choroplethr")
install.packages("choroplethrMaps")
install.packages("GGally")
install.packages("ade4")
install.packages("data.table")
After installing all the useful libraries, next step is to load the libraries in order to use it throughout the analysis.
Import dataset and saving it to a variable named as AirBnb in CSV format. Initial six observations are displayed to look a brief insight to all attributes and some records.
AirBnb = read.csv("AirBNB.csv")
head(AirBnb)
## host_id host_name host_since_year host_since_anniversary id
## 1 1662 Chloe 2008 08-Nov 304958
## 2 3159 Daniel 2008 Sep-24 2818
## 3 3718 Britta 2008 Oct-19 103026
## 4 4716 Stefan 2008 Nov-30 550017
## 5 5271 Tyler 2008 Dec-17 4728389
## 6 5271 Tyler 2008 Dec-17 5500954
## neighbourhood_cleansed city state zipcode
## 1 Westerpark Amsterdam North Holland 1053
## 2 Oostelijk Havengebied - Indische Buurt Amsterdam North Holland
## 3 De Baarsjes - Oud-West Amsterdam Noord-Holland 1053
## 4 Centrum-Oost Amsterdam North Holland 1017
## 5 Centrum-West Amsterdam Noord-Holland 1016 AM
## 6 Centrum-West Amsterdam NH 1016 AM
## country property_type room_type accommodates bathrooms bedrooms
## 1 Netherlands Apartment Entire home/apt 4 2 2
## 2 Netherlands Apartment Private room 2 1 1
## 3 Netherlands Apartment Entire home/apt 4 1 1
## 4 Netherlands Apartment Entire home/apt 2 1 1
## 5 Netherlands Apartment Entire home/apt 6 1 2
## 6 Netherlands Apartment Private room 4 1 1
## beds bed_type price guests_included extra_people minimum_nights
## 1 2 Real Bed 130 4 10 4
## 2 2 Real Bed 59 1 10 3
## 3 1 Real Bed 95 2 25 3
## 4 1 Real Bed 100 1 10 2
## 5 2 Real Bed 250 2 25 2
## 6 1 Real Bed 140 2 25 2
## host_response_time host_response_rate number_of_reviews review_scores_rating
## 1 within a day 0.8 11 98
## 2 within an hour 1 108 97
## 3 within a few hours 1 15 92
## 4 within a day 1 20 97
## 5 within a day 0.89 1 100
## 6 within a day 0.9 0 NA
## review_scores_accuracy review_scores_cleanliness review_scores_checkin
## 1 10 10 9
## 2 10 10 10
## 3 9 9 10
## 4 10 10 10
## 5 8 10 8
## 6 NA NA NA
## review_scores_communication review_scores_location review_scores_value
## 1 10 10 10
## 2 10 9 10
## 3 10 9 9
## 4 10 10 10
## 5 10 10 6
## 6 NA NA NA
dim(AirBnb)
## [1] 7833 31
In our dataset we have 31 features and 7833 observations
Checking the names of all features, as they need to be appropriate to understand.
colnames(AirBnb)
## [1] "host_id" "host_name"
## [3] "host_since_year" "host_since_anniversary"
## [5] "id" "neighbourhood_cleansed"
## [7] "city" "state"
## [9] "zipcode" "country"
## [11] "property_type" "room_type"
## [13] "accommodates" "bathrooms"
## [15] "bedrooms" "beds"
## [17] "bed_type" "price"
## [19] "guests_included" "extra_people"
## [21] "minimum_nights" "host_response_time"
## [23] "host_response_rate" "number_of_reviews"
## [25] "review_scores_rating" "review_scores_accuracy"
## [27] "review_scores_cleanliness" "review_scores_checkin"
## [29] "review_scores_communication" "review_scores_location"
## [31] "review_scores_value"
In our case all the names are good enough to read and understand.
As we see, in column names there are some dimensions that has some categorical data. Initially when data is loaded they are read as character data type. In order to work on such variables their Data Types needs to be converted to Factor.
AirBnb <- as.data.frame(unclass(AirBnb), stringsAsFactors = TRUE)
str(AirBnb)
## 'data.frame': 7833 obs. of 31 variables:
## $ host_id : int 1662 3159 3718 4716 5271 5271 5271 5988 9616 14589 ...
## $ host_name : Factor w/ 2987 levels "(email hidden)",..: 439 522 348 2644 2806 2806 2806 2343 1576 2486 ...
## $ host_since_year : int 2008 2008 2008 2008 2008 2008 2008 2009 2009 2009 ...
## $ host_since_anniversary : Factor w/ 366 levels "01-Apr","01-Aug",..: 94 360 336 329 186 186 186 1 36 155 ...
## $ id : int 304958 2818 103026 550017 4728389 5500954 5181918 2774924 23651 738245 ...
## $ neighbourhood_cleansed : Factor w/ 22 levels "Bijlmer-Centrum",..: 21 15 8 5 6 6 6 22 9 6 ...
## $ city : Factor w/ 35 levels "Ã\201msterdam",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ state : Factor w/ 23 levels "","Amsterdam",..: 19 19 15 19 15 12 15 19 19 19 ...
## $ zipcode : Factor w/ 3276 levels ""," ","....",..: 1429 1 1429 683 551 551 551 2243 2641 382 ...
## $ country : Factor w/ 1 level "Netherlands": 1 1 1 1 1 1 1 1 1 1 ...
## $ property_type : Factor w/ 15 levels "Apartment","Bed & Breakfast",..: 1 1 1 1 1 1 1 9 1 9 ...
## $ room_type : Factor w/ 3 levels "Entire home/apt",..: 1 2 1 1 1 2 2 2 2 1 ...
## $ accommodates : int 4 2 4 2 6 4 2 2 3 2 ...
## $ bathrooms : num 2 1 1 1 1 1 1 1 1 1 ...
## $ bedrooms : int 2 1 1 1 2 1 1 1 1 1 ...
## $ beds : int 2 2 1 1 2 1 1 1 1 1 ...
## $ bed_type : Factor w/ 5 levels "Airbed","Couch",..: 5 5 5 5 5 5 3 5 5 5 ...
## $ price : int 130 59 95 100 250 140 115 80 80 90 ...
## $ guests_included : int 4 1 2 1 2 2 1 1 2 1 ...
## $ extra_people : int 10 10 25 10 25 25 0 0 15 0 ...
## $ minimum_nights : int 4 3 3 2 2 2 1 3 6 3 ...
## $ host_response_time : Factor w/ 5 levels "a few days or more",..: 3 5 4 3 3 3 3 5 3 2 ...
## $ host_response_rate : Factor w/ 86 levels "0.02","0.05",..: 65 85 85 85 74 75 74 85 85 86 ...
## $ number_of_reviews : int 11 108 15 20 1 0 4 33 36 8 ...
## $ review_scores_rating : int 98 97 92 97 100 NA 95 95 96 93 ...
## $ review_scores_accuracy : int 10 10 9 10 8 NA 9 9 9 10 ...
## $ review_scores_cleanliness : int 10 10 9 10 10 NA 9 10 10 9 ...
## $ review_scores_checkin : int 9 10 10 10 8 NA 9 10 10 9 ...
## $ review_scores_communication: int 10 10 10 10 10 NA 10 10 10 9 ...
## $ review_scores_location : int 10 9 9 10 10 NA 10 10 9 10 ...
## $ review_scores_value : int 10 10 9 10 6 NA 9 9 9 9 ...
Now we check the number of null values and variables that consist of null values in the dataset.
sum(is.na(AirBnb))
## [1] 12051
summary(is.na(AirBnb))
## host_id host_name host_since_year host_since_anniversary
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:7833 FALSE:7833 FALSE:7833 FALSE:7833
##
## id neighbourhood_cleansed city state
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:7833 FALSE:7833 FALSE:7833 FALSE:7833
##
## zipcode country property_type room_type
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:7833 FALSE:7833 FALSE:7833 FALSE:7833
##
## accommodates bathrooms bedrooms beds
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:7833 FALSE:7764 FALSE:7819 FALSE:7820
## TRUE :69 TRUE :14 TRUE :13
## bed_type price guests_included extra_people
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:7833 FALSE:7833 FALSE:7833 FALSE:7833
##
## minimum_nights host_response_time host_response_rate number_of_reviews
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:7833 FALSE:7833 FALSE:7833 FALSE:7833
##
## review_scores_rating review_scores_accuracy review_scores_cleanliness
## Mode :logical Mode :logical Mode :logical
## FALSE:6135 FALSE:6124 FALSE:6124
## TRUE :1698 TRUE :1709 TRUE :1709
## review_scores_checkin review_scores_communication review_scores_location
## Mode :logical Mode :logical Mode :logical
## FALSE:6125 FALSE:6122 FALSE:6124
## TRUE :1708 TRUE :1711 TRUE :1709
## review_scores_value
## Mode :logical
## FALSE:6122
## TRUE :1711
Total number of NA’s are 12051. In summary we can see there are some dimensions that consist of NA’s which includes bathrooms, bedrooms, beds and reviews.
Here we can graphically visualize the null values in each attribute
gg_miss_var(AirBnb)
Heat plot that clearly mention the features containing null values and overall percentage of missing and present values.
vis_miss(AirBnb) + theme(axis.text.x = element_text(angle = 90))
## Handling NA’s and Null values
Now null values in the dataset needs to be handled. There are two types of data in our dataset that is numeric and non-numeric. For numeric values, NA’s in a particular feature is replaced by the mean of the total observations present in that feature. As we don’t have any NA’s present in the non-numeric variables, so we will leave them as it is.
AirBnb <- AirBnb %>%
mutate(review_scores_rating = ifelse(is.na(review_scores_rating), mean(review_scores_rating,na.rm=TRUE),review_scores_rating),
bedrooms = ifelse(is.na(bedrooms), mean(bedrooms,na.rm=TRUE),bedrooms), beds = ifelse(is.na(beds), mean(beds,na.rm=TRUE),beds),
bathrooms = ifelse(is.na(bathrooms), mean(bathrooms,na.rm=TRUE),bathrooms))
AirBnb <- AirBnb %>%
mutate(review_scores_accuracy = ifelse(is.na(review_scores_accuracy), mean(review_scores_accuracy,na.rm=TRUE),review_scores_accuracy),
review_scores_cleanliness = ifelse(is.na(review_scores_cleanliness), mean(review_scores_cleanliness,na.rm=TRUE),review_scores_cleanliness),
review_scores_checkin = ifelse(is.na(review_scores_checkin), mean(review_scores_checkin,na.rm=TRUE),review_scores_checkin),
review_scores_communication = ifelse(is.na(review_scores_communication), mean(review_scores_communication,na.rm=TRUE),review_scores_communication),
review_scores_location = ifelse(is.na(review_scores_location), mean(review_scores_location,na.rm=TRUE),review_scores_location),
review_scores_value = ifelse(is.na(review_scores_value), mean(review_scores_value,na.rm=TRUE),review_scores_value))
AirBnb <- AirBnb %>%
mutate(host_response_rate = ifelse(host_response_rate== "NA", mean(host_response_rate), host_response_rate),
host_response_time = ifelse(host_response_time== "NA", NA, host_response_time))
Here we again visualize the dataset after handling Null Values from the the dataset.
gg_miss_var(AirBnb)
vis_miss(AirBnb) + theme(axis.text.x = element_text(angle = 90))
It’s now confirm that our dataset has all the data present and there is no missing values anymore.
summary(AirBnb)
## host_id host_name host_since_year host_since_anniversary
## Min. : 1662 Douwe&Niki : 91 Min. :2008 Jun-19 : 118
## 1st Qu.: 3430410 Jorrit&Dirk: 72 1st Qu.:2012 02-May : 95
## Median : 7392601 Myra : 59 Median :2013 Aug-21 : 90
## Mean : 9879849 Peter : 50 Mean :2013 12-Feb : 51
## 3rd Qu.:15054166 Michiel : 49 3rd Qu.:2014 10-Sep : 49
## Max. :30595041 Anne : 43 Max. :2015 Aug-31 : 45
## (Other) :7469 (Other):7385
## id neighbourhood_cleansed city
## Min. : 2818 Centrum-West :1426 Amsterdam :7702
## 1st Qu.:1309364 De Baarsjes - Oud-West :1203 Amsterdam-Zuidoost: 35
## Median :2964891 Centrum-Oost : 920 Diemen : 14
## Mean :2926936 De Pijp - Rivierenbuurt: 906 Jordaan : 13
## 3rd Qu.:4473450 Westerpark : 689 Watergraafsmeer : 9
## Max. :5897527 Zuid : 579 Ã\201msterdam : 7
## (Other) :2110 (Other) : 53
## state zipcode country property_type
## North Holland:5761 1054 : 209 Netherlands:7833 Apartment :6280
## Noord-Holland:1877 1015 : 181 House : 711
## NH : 159 1017 : 176 Bed & Breakfast: 370
## : 8 : 173 Boat : 327
## Noord Holland: 5 1053 : 155 Loft : 77
## Amsterdam : 3 1013 : 149 Other : 29
## (Other) : 20 (Other):6790 (Other) : 39
## room_type accommodates bathrooms bedrooms
## Entire home/apt:6305 Min. : 1.000 Min. :0.000 Min. : 0.000
## Private room :1482 1st Qu.: 2.000 1st Qu.:1.000 1st Qu.: 1.000
## Shared room : 46 Median : 2.000 Median :1.000 Median : 1.000
## Mean : 3.115 Mean :1.113 Mean : 1.415
## 3rd Qu.: 4.000 3rd Qu.:1.000 3rd Qu.: 2.000
## Max. :16.000 Max. :8.000 Max. :10.000
##
## beds bed_type price guests_included
## Min. : 1.000 Airbed : 13 Min. : 15 Min. : 0.000
## 1st Qu.: 1.000 Couch : 11 1st Qu.: 85 1st Qu.: 1.000
## Median : 1.000 Futon : 26 Median : 109 Median : 1.000
## Mean : 1.984 Pull-out Sofa: 94 Mean : 129 Mean : 1.642
## 3rd Qu.: 2.000 Real Bed :7689 3rd Qu.: 150 3rd Qu.: 2.000
## Max. :16.000 Max. :9000 Max. :16.000
##
## extra_people minimum_nights host_response_time host_response_rate
## Min. : 0.00 Min. : 1.000 Min. :1.000 Min. : 1.00
## 1st Qu.: 0.00 1st Qu.: 1.000 1st Qu.:3.000 1st Qu.:75.00
## Median : 0.00 Median : 2.000 Median :4.000 Median :85.00
## Mean : 13.62 Mean : 2.509 Mean :3.756 Mean :76.83
## 3rd Qu.: 25.00 3rd Qu.: 3.000 3rd Qu.:5.000 3rd Qu.:85.00
## Max. :235.00 Max. :27.000 Max. :5.000 Max. :86.00
##
## number_of_reviews review_scores_rating review_scores_accuracy
## Min. : 0.00 Min. : 20.00 Min. : 2.000
## 1st Qu.: 1.00 1st Qu.: 92.00 1st Qu.: 9.000
## Median : 5.00 Median : 93.34 Median : 9.447
## Mean : 13.83 Mean : 93.34 Mean : 9.447
## 3rd Qu.: 15.00 3rd Qu.: 98.00 3rd Qu.:10.000
## Max. :297.00 Max. :100.00 Max. :10.000
##
## review_scores_cleanliness review_scores_checkin review_scores_communication
## Min. : 2.00 Min. : 2.000 Min. : 2.000
## 1st Qu.: 9.00 1st Qu.: 9.639 1st Qu.: 9.698
## Median : 9.29 Median :10.000 Median :10.000
## Mean : 9.29 Mean : 9.639 Mean : 9.698
## 3rd Qu.:10.00 3rd Qu.:10.000 3rd Qu.:10.000
## Max. :10.00 Max. :10.000 Max. :10.000
##
## review_scores_location review_scores_value
## Min. : 2.000 Min. : 2.00
## 1st Qu.: 9.000 1st Qu.: 9.00
## Median : 9.293 Median : 9.00
## Mean : 9.293 Mean : 9.04
## 3rd Qu.:10.000 3rd Qu.: 9.04
## Max. :10.000 Max. :10.00
##
Visualizing data in terms of no. of dimensions, no. of observations, data types and all the column names using Glimpse
glimpse(AirBnb)
## Rows: 7,833
## Columns: 31
## $ host_id <int> 1662, 3159, 3718, 4716, 5271, 5271, 5271, ~
## $ host_name <fct> "Chloe", "Daniel", "Britta", "Stefan", "Ty~
## $ host_since_year <int> 2008, 2008, 2008, 2008, 2008, 2008, 2008, ~
## $ host_since_anniversary <fct> 08-Nov, Sep-24, Oct-19, Nov-30, Dec-17, De~
## $ id <int> 304958, 2818, 103026, 550017, 4728389, 550~
## $ neighbourhood_cleansed <fct> Westerpark, Oostelijk Havengebied - Indisc~
## $ city <fct> "Amsterdam", "Amsterdam", "Amsterdam", "Am~
## $ state <fct> North Holland, North Holland, Noord-Hollan~
## $ zipcode <fct> 1053, , 1053, 1017, 1016 AM, 1016 AM, 1016~
## $ country <fct> Netherlands, Netherlands, Netherlands, Net~
## $ property_type <fct> Apartment, Apartment, Apartment, Apartment~
## $ room_type <fct> Entire home/apt, Private room, Entire home~
## $ accommodates <int> 4, 2, 4, 2, 6, 4, 2, 2, 3, 2, 3, 3, 2, 3, ~
## $ bathrooms <dbl> 2.000000, 1.000000, 1.000000, 1.000000, 1.~
## $ bedrooms <dbl> 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, ~
## $ beds <dbl> 2, 2, 1, 1, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, ~
## $ bed_type <fct> Real Bed, Real Bed, Real Bed, Real Bed, Re~
## $ price <int> 130, 59, 95, 100, 250, 140, 115, 80, 80, 9~
## $ guests_included <int> 4, 1, 2, 1, 2, 2, 1, 1, 2, 1, 2, 1, 1, 2, ~
## $ extra_people <int> 10, 10, 25, 10, 25, 25, 0, 0, 15, 0, 30, 0~
## $ minimum_nights <int> 4, 3, 3, 2, 2, 2, 1, 3, 6, 3, 7, 3, 3, 4, ~
## $ host_response_time <int> 3, 5, 4, 3, 3, 3, 3, 5, 3, 2, 4, 3, 5, 3, ~
## $ host_response_rate <int> 65, 85, 85, 85, 74, 75, 74, 85, 85, 86, 85~
## $ number_of_reviews <int> 11, 108, 15, 20, 1, 0, 4, 33, 36, 8, 3, 2,~
## $ review_scores_rating <dbl> 98.0000, 97.0000, 92.0000, 97.0000, 100.00~
## $ review_scores_accuracy <dbl> 10.00000, 10.00000, 9.00000, 10.00000, 8.0~
## $ review_scores_cleanliness <dbl> 10.000000, 10.000000, 9.000000, 10.000000,~
## $ review_scores_checkin <dbl> 9.000000, 10.000000, 10.000000, 10.000000,~
## $ review_scores_communication <dbl> 10.000000, 10.000000, 10.000000, 10.000000~
## $ review_scores_location <dbl> 10.000000, 9.000000, 9.000000, 10.000000, ~
## $ review_scores_value <dbl> 10.000000, 10.000000, 9.000000, 10.000000,~
This pie chart is used to find the types of neighbour hood group in Netherland along with their percentages.
property_type_d <- data.frame(table(AirBnb$property_type))
property_type_data <- property_type_d[,c('Var1', 'Freq')]
fig <- plot_ly(property_type_data, labels = ~Var1, values = ~Freq, type = 'pie')
fig
# Group neighbourhood_cleansed variable with room_type.
property_df <- AirBnb %>%
group_by(neighbourhood_cleansed, room_type) %>%
summarize(Freq = n())
# Filtering room_type and grouping it with particular neighbourhood_cleansed
total_property <- AirBnb %>%
filter(room_type %in% c("Private room","Entire home/apt","Shared room")) %>%
group_by(neighbourhood_cleansed) %>%
summarize(sum = n())
# Merging both variables in order to visualize and plot
property_ratio <- merge (property_df, total_property, by="neighbourhood_cleansed")
property_ratio <- property_ratio %>%
mutate(ratio = Freq/sum)
# Plot listings present in each neighbourhood group
ggplot(property_ratio, aes(x=neighbourhood_cleansed, y = ratio, fill = room_type)) + geom_bar(position = "dodge", stat="identity") +
xlab("Neighbourhood Cleansed") + ylab ("Property Count") +
scale_fill_discrete(name = "Property Type") +
scale_y_continuous(labels = scales::percent) +
coord_flip()
Above graph shows the percentage of each listing in each neighbour hood cleansed. Furthermore, it gives insight that ‘Shared Room’ listings are amateur in all the groups. On the other hand ‘Private Room’ listings are most popular in each group except in Manhattan group.
AirBnb %>%
group_by(neighbourhood_cleansed) %>%
summarise(mean_price = mean(price, na.rm = TRUE)) %>%
ggplot(aes(x = reorder(neighbourhood_cleansed, mean_price), y = mean_price, fill = neighbourhood_cleansed)) +
geom_col(stat ="identity", color = "black", fill="maroon") +
coord_flip() +
theme_gray() +
labs(x = "Neighbourhood Group", y = "Price") +
geom_text(aes(label = round(mean_price,digit = 2)), hjust = 2.0, color = "white", size = 3.5) +
ggtitle("Mean Price comparison for each Neighbourhood Group", subtitle = "Price vs Neighbourhood Group") +
xlab("Neighbourhood Group") +
ylab("Mean Price") +
theme(legend.position = "none",
plot.title = element_text(color = "black", size = 14, face = "bold", hjust = 1),
plot.subtitle = element_text(color = "black", hjust = 0.5),
axis.title.y = element_text(),
axis.title.x = element_text(),
axis.ticks = element_blank())
AirBnb %>%
filter(!(is.na(room_type))) %>%
filter(!(room_type == "Unknown")) %>%
group_by(room_type) %>%
summarise(mean_price = mean(price, na.rm = TRUE)) %>%
ggplot(aes(x = reorder(room_type, mean_price), y = mean_price, fill = room_type)) +
geom_col(stat ="identity", color = "black", fill="orange") +
coord_flip() +
theme_gray() +
labs(x = "Room Type", y = "Price") +
geom_text(aes(label = round(mean_price,digit = 2)), hjust = 2.0, color = "black", size = 3.5) +
ggtitle("Mean Price comparison with all Room Types", subtitle = "Price vs Room Type") +
xlab("Room Type") +
ylab("Mean Price") +
theme(legend.position = "none",
plot.title = element_text(color = "black", size = 14, face = "bold", hjust = 0.5),
plot.subtitle = element_text(color = "black", hjust = 0.5),
axis.title.y = element_text(),
axis.title.x = element_text(),
axis.ticks = element_blank())
## Warning: Ignoring unknown parameters: stat
Correlation plot is made to find relationship among features.
airbnb.corr <- AirBnb %>%
select(price, minimum_nights, accommodates, bathrooms, bedrooms, beds, guests_included, extra_people)
cor(airbnb.corr) # get the correlation matrix
## price minimum_nights accommodates bathrooms bedrooms
## price 1.00000000 0.01903058 0.34302041 0.22020643 0.34534540
## minimum_nights 0.01903058 1.00000000 0.01783162 0.03515667 0.08472229
## accommodates 0.34302041 0.01783162 1.00000000 0.44742126 0.70468281
## bathrooms 0.22020643 0.03515667 0.44742126 1.00000000 0.43230198
## bedrooms 0.34534540 0.08472229 0.70468281 0.43230198 1.00000000
## beds 0.31670780 0.04521712 0.82401499 0.46935595 0.70831706
## guests_included 0.23804450 0.03152692 0.51068034 0.23791043 0.43861282
## extra_people 0.11928948 -0.04788970 0.32452077 0.12239996 0.19291047
## beds guests_included extra_people
## price 0.31670780 0.23804450 0.1192895
## minimum_nights 0.04521712 0.03152692 -0.0478897
## accommodates 0.82401499 0.51068034 0.3245208
## bathrooms 0.46935595 0.23791043 0.1224000
## bedrooms 0.70831706 0.43861282 0.1929105
## beds 1.00000000 0.45365983 0.2282370
## guests_included 0.45365983 1.00000000 0.4400605
## extra_people 0.22823700 0.44006047 1.0000000
corrplot(cor(airbnb.corr), method = "number", type = "lower", bg = "grey") # put this in a nice table
We are going to implement two regression models. First one is linar regression and second is multiple regression.
In linear regression model, variable which is going to be predict is price while the the predictor is accommodates. For simple linear regression, Does number of accommodates make an impact on price or not?
Price vs Accommodates graph has been drawn to visualize the trend among the features.
ggplot(data = AirBnb, mapping = aes(x = accommodates, y = price)) +
geom_jitter() # jitter instead of points, otherwise many dots get drawn over each other
After visualizing points, now drawing a regression line that best suits the points and has minimum r-squared value.
ggplot(data = AirBnb, mapping = aes(x = accommodates, y = log(price, base = exp(1)))) +
geom_jitter() + # jitter instead of points, otherwise many dots get drawn over each other
stat_summary(fun.y=mean, colour="green", size = 4, geom="point", shape = 23, fill = "green") + # means
stat_smooth(method = "lm", se=FALSE) # regression line
## Warning: `fun.y` is deprecated. Use `fun` instead.
## `geom_smooth()` using formula 'y ~ x'
We create a linear model. The first argument is the model which takes the form of dependent variable ~ independent variable(s). The second argument is the data we should consider.
linearmodel <- lm(price ~ accommodates, data = AirBnb)
Plot linear model to visualize stats of the model
par(mfrow=c(2,2))
plot(linearmodel)
Summary of the linear model to check parameters like p-value, r-square, adjusted r-squared
summary(linearmodel) # ask for a summary of this linear model
##
## Call:
## lm(formula = price ~ accommodates, data = AirBnb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -416.0 -32.2 -11.1 22.9 8898.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 51.1792 2.7654 18.51 <2e-16 ***
## accommodates 24.9890 0.7733 32.32 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 120.3 on 7831 degrees of freedom
## Multiple R-squared: 0.1177, Adjusted R-squared: 0.1176
## F-statistic: 1044 on 1 and 7831 DF, p-value: < 2.2e-16
In order to remove outliers, extreme values of price from lower and upper bound both.
AirBnb_filtered_data <- AirBnb %>%
filter(price < quantile(AirBnb$price, 0.9) & price > quantile(AirBnb$price, 0.1))
Storing training data in “training_data” and testing data in “testing_data”. We split data in the ration 50:50
set.seed(12345)
AirBnb_filtered_data <- AirBnb_filtered_data %>% mutate(id = row_number())
training_data <- AirBnb_filtered_data %>% sample_frac(.5) %>% filter(price > 0)
testing_data <- anti_join(AirBnb_filtered_data, training_data, by = 'id') %>% filter(price > 0)
Checking the splitting of data is done correctly or not as we filter the data by omitting extreme values. Adding test and train data together it will be equal to the original data. This is a sanity check.
nrow(training_data) + nrow(testing_data) == nrow(AirBnb_filtered_data %>% filter(price > 0))
## [1] TRUE
Variable selection model is used to select the appropriate variables for the model. Here I used Best Subset Regression Method.
best_fit_model <- regsubsets (price ~neighbourhood_cleansed + minimum_nights + accommodates + bathrooms + bedrooms + beds + guests_included + extra_people + property_type + room_type + number_of_reviews, data = training_data, nbest = 2, nvmax = 11)
summary(best_fit_model)
plot(best_fit_model, scale="bic")
According to variable selection method output, we consider neighourhood_cleansed, minimum_nights, property_type, accommodates, beds, bedrooms, bathrooms, extra_people and number of views.
Now a model is created with the best variable selected by method.
Linear_Model<-lm(price ~ neighbourhood_cleansed + minimum_nights + accommodates + bathrooms + bedrooms + beds + extra_people + room_type +number_of_reviews, data = training_data)
Linear_Model_Summary <- summary(Linear_Model)
Linear_Model_MSE <- Linear_Model_Summary$sigma^2
Linear_Model_RSQ <- Linear_Model_Summary$r.squared
Linear_Model_ARSQ <- Linear_Model_Summary$adj.r.squared
Linear_Model_Summary
##
## Call:
## lm(formula = price ~ neighbourhood_cleansed + minimum_nights +
## accommodates + bathrooms + bedrooms + beds + extra_people +
## room_type + number_of_reviews, data = training_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -201.014 -19.030 -4.214 16.249 90.056
##
## Coefficients:
## Estimate
## (Intercept) 74.59828
## neighbourhood_cleansedBijlmer-Oost -5.17820
## neighbourhood_cleansedBos en Lommer -3.54984
## neighbourhood_cleansedBuitenveldert - Zuidas 0.76130
## neighbourhood_cleansedCentrum-Oost 22.95496
## neighbourhood_cleansedCentrum-West 29.69401
## neighbourhood_cleansedDe Aker - Nieuw Sloten -3.39186
## neighbourhood_cleansedDe Baarsjes - Oud-West 9.98905
## neighbourhood_cleansedDe Pijp - Rivierenbuurt 10.41505
## neighbourhood_cleansedGaasperdam - Driemond 3.66337
## neighbourhood_cleansedGeuzenveld - Slotermeer -6.84094
## neighbourhood_cleansedIJburg - Zeeburgereiland 2.46950
## neighbourhood_cleansedNoord-Oost -10.80096
## neighbourhood_cleansedNoord-West 3.88900
## neighbourhood_cleansedOostelijk Havengebied - Indische Buurt -3.17246
## neighbourhood_cleansedOsdorp -5.97466
## neighbourhood_cleansedOud-Noord 2.69579
## neighbourhood_cleansedOud-Oost 3.16678
## neighbourhood_cleansedSlotervaart 1.71980
## neighbourhood_cleansedWatergraafsmeer 7.63674
## neighbourhood_cleansedWesterpark 7.19021
## neighbourhood_cleansedZuid 10.12776
## minimum_nights -0.60984
## accommodates 3.73188
## bathrooms 3.54347
## bedrooms 12.69799
## beds 0.87705
## extra_people 0.01790
## room_typePrivate room -17.23587
## room_typeShared room -26.53055
## number_of_reviews -0.14796
## Std. Error t value
## (Intercept) 9.75751 7.645
## neighbourhood_cleansedBijlmer-Oost 21.37608 -0.242
## neighbourhood_cleansedBos en Lommer 9.88859 -0.359
## neighbourhood_cleansedBuitenveldert - Zuidas 10.81780 0.070
## neighbourhood_cleansedCentrum-Oost 9.72979 2.359
## neighbourhood_cleansedCentrum-West 9.69183 3.064
## neighbourhood_cleansedDe Aker - Nieuw Sloten 12.14670 -0.279
## neighbourhood_cleansedDe Baarsjes - Oud-West 9.69919 1.030
## neighbourhood_cleansedDe Pijp - Rivierenbuurt 9.72624 1.071
## neighbourhood_cleansedGaasperdam - Driemond 21.35263 0.172
## neighbourhood_cleansedGeuzenveld - Slotermeer 13.15579 -0.520
## neighbourhood_cleansedIJburg - Zeeburgereiland 10.88402 0.227
## neighbourhood_cleansedNoord-Oost 14.00540 -0.771
## neighbourhood_cleansedNoord-West 11.63768 0.334
## neighbourhood_cleansedOostelijk Havengebied - Indische Buurt 9.90415 -0.320
## neighbourhood_cleansedOsdorp 13.18808 -0.453
## neighbourhood_cleansedOud-Noord 10.12277 0.266
## neighbourhood_cleansedOud-Oost 9.84387 0.322
## neighbourhood_cleansedSlotervaart 10.40983 0.165
## neighbourhood_cleansedWatergraafsmeer 10.20604 0.748
## neighbourhood_cleansedWesterpark 9.76404 0.736
## neighbourhood_cleansedZuid 9.81053 1.032
## minimum_nights 0.28263 -2.158
## accommodates 0.67443 5.533
## bathrooms 1.68401 2.104
## bedrooms 0.98920 12.837
## beds 0.71194 1.232
## extra_people 0.02968 0.603
## room_typePrivate room 1.50652 -11.441
## room_typeShared room 9.71258 -2.732
## number_of_reviews 0.01993 -7.425
## Pr(>|t|)
## (Intercept) 2.80e-14 ***
## neighbourhood_cleansedBijlmer-Oost 0.80861
## neighbourhood_cleansedBos en Lommer 0.71963
## neighbourhood_cleansedBuitenveldert - Zuidas 0.94390
## neighbourhood_cleansedCentrum-Oost 0.01838 *
## neighbourhood_cleansedCentrum-West 0.00220 **
## neighbourhood_cleansedDe Aker - Nieuw Sloten 0.78008
## neighbourhood_cleansedDe Baarsjes - Oud-West 0.30315
## neighbourhood_cleansedDe Pijp - Rivierenbuurt 0.28434
## neighbourhood_cleansedGaasperdam - Driemond 0.86379
## neighbourhood_cleansedGeuzenveld - Slotermeer 0.60311
## neighbourhood_cleansedIJburg - Zeeburgereiland 0.82052
## neighbourhood_cleansedNoord-Oost 0.44065
## neighbourhood_cleansedNoord-West 0.73827
## neighbourhood_cleansedOostelijk Havengebied - Indische Buurt 0.74875
## neighbourhood_cleansedOsdorp 0.65056
## neighbourhood_cleansedOud-Noord 0.79002
## neighbourhood_cleansedOud-Oost 0.74770
## neighbourhood_cleansedSlotervaart 0.86879
## neighbourhood_cleansedWatergraafsmeer 0.45436
## neighbourhood_cleansedWesterpark 0.46155
## neighbourhood_cleansedZuid 0.30200
## minimum_nights 0.03103 *
## accommodates 3.41e-08 ***
## bathrooms 0.03545 *
## bedrooms < 2e-16 ***
## beds 0.21808
## extra_people 0.54645
## room_typePrivate room < 2e-16 ***
## room_typeShared room 0.00634 **
## number_of_reviews 1.46e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26.93 on 2973 degrees of freedom
## Multiple R-squared: 0.3016, Adjusted R-squared: 0.2946
## F-statistic: 42.8 on 30 and 2973 DF, p-value: < 2.2e-16
MSE, R-Squared and Adjusted R-Squared of the Model are respectively.
Linear_Model_MSE
## [1] 725.4168
Linear_Model_RSQ
## [1] 0.3016399
Linear_Model_ARSQ
## [1] 0.2945928
par(mfrow=c(2,2))
plot(Linear_Model)
Residuals vs fitted values shows that the dots are not evenly distributed around zero and do not show a constant variance around X. This means that linearity and equal variance assumptions are not satisifed.
QQ plot shows a 45 degree line meaning that Nomrality assumptions are met.
Linear_Model_Test <- predict(object = Linear_Model, newdata = testing_data)
Now calculating MSE for Test Data
mean((Linear_Model_Test - testing_data$price)^2)
## [1] 739.5444
Calculating MSPE for filtered data set
Linear_Model_FD <- glm (price ~ neighbourhood_cleansed + minimum_nights + accommodates + bathrooms + bedrooms + beds + extra_people + room_type +number_of_reviews, data = AirBnb_filtered_data)
cv.glm(data= AirBnb_filtered_data, glmfit = Linear_Model_FD, K = 3)$delta[2]
## [1] 734.8276
Comparing MSE of the filtered data which is almost equals to 735 and the MSE of the test data is 739 which is very near to the value of fileterd data MSE. So variables selected for model are good predictors.
Based on the property characteristics and various parameters is the price high for particular property?
Loading Tree package
require(tree)
## Loading required package: tree
## Warning: package 'tree' was built under R version 4.1.2
## Registered S3 method overwritten by 'tree':
## method from
## print.tree cli
For classification, we need a discrete variable for classification algorithm. In our case, target variable is price. We made another variable named as price_cat and categorize the price into “Cheap” and “Expensive”. Price is categorize on the basis of mean of price, if price is greater than mean price, it is assigned to EXPENSIVE category and if less than mean price then the particular value is assigned to CHEAP category.
New feature price_cat is attached with the original data and change the data type to factor for further processing.
AirBnb_filtered_data_cat <- AirBnb %>%
mutate(price_cat = ifelse(price <= mean(price),"Cheap","Expensive"))
AirBnb_filtered_data_cat = data.frame(AirBnb_filtered_data_cat, AirBnb_filtered_data_cat$price_cat)
AirBnb_filtered_data_cat$price_cat = as.factor(AirBnb_filtered_data_cat$price_cat)
We will drop the variables which are not important including price, as we can’t have price variable here because pur response variable price_Cat is created from price.
Afterwards, we will fit our model using AirBnb_filtered_data_cat, by setting the target variable i.e. price_cat.
AirBnb_filtered_data_cat = select(AirBnb_filtered_data_cat, -c(price,host_id,host_name,host_since_year,host_since_anniversary,id,city,country,state,zipcode))
tree.AirBnb_filtered_data_cat = tree(price_cat~., data = AirBnb_filtered_data_cat)
In summary we can see the terminal nodes, the residual mean deviance and missclassification error rate.
summary(tree.AirBnb_filtered_data_cat)
##
## Classification tree:
## tree(formula = price_cat ~ ., data = AirBnb_filtered_data_cat)
## Variables actually used in tree construction:
## [1] "accommodates" "neighbourhood_cleansed" "room_type"
## [4] "bedrooms" "guests_included" "extra_people"
## [7] "review_scores_location"
## Number of terminal nodes: 8
## Residual mean deviance: 0.9146 = 7157 / 7825
## Misclassification error rate: 0.2054 = 1609 / 7833
Now plot the tree for better visuals
plot(tree.AirBnb_filtered_data_cat)
text(tree.AirBnb_filtered_data_cat, pretty = 0)
tree.AirBnb_filtered_data_cat
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 7833 10070.0 Cheap ( 0.65760 0.34240 )
## 2) accommodates < 3.5 4929 4597.0 Cheap ( 0.82329 0.17671 )
## 4) neighbourhood_cleansed: Bijlmer-Centrum,Bijlmer-Oost,Bos en Lommer,Buitenveldert - Zuidas,De Aker - Nieuw Sloten,De Baarsjes - Oud-West,De Pijp - Rivierenbuurt,Gaasperdam - Driemond,Geuzenveld - Slotermeer,IJburg - Zeeburgereiland,Noord-Oost,Noord-West,Oostelijk Havengebied - Indische Buurt,Osdorp,Oud-Noord,Oud-Oost,Slotervaart,Watergraafsmeer,Westerpark,Zuid 3563 2457.0 Cheap ( 0.89082 0.10918 )
## 8) room_type: Private room,Shared room 1073 166.3 Cheap ( 0.98509 0.01491 ) *
## 9) room_type: Entire home/apt 2490 2103.0 Cheap ( 0.85020 0.14980 ) *
## 5) neighbourhood_cleansed: Centrum-Oost,Centrum-West 1366 1774.0 Cheap ( 0.64714 0.35286 ) *
## 3) accommodates > 3.5 2904 3846.0 Expensive ( 0.37638 0.62362 )
## 6) bedrooms < 1.20744 784 1028.0 Cheap ( 0.63648 0.36352 ) *
## 7) bedrooms > 1.20744 2120 2515.0 Expensive ( 0.28019 0.71981 )
## 14) guests_included < 3.5 1517 1964.0 Expensive ( 0.35003 0.64997 )
## 28) extra_people < 2.5 697 649.6 Expensive ( 0.17647 0.82353 ) *
## 29) extra_people > 2.5 820 1137.0 Expensive ( 0.49756 0.50244 )
## 58) review_scores_location < 9.14647 347 422.1 Cheap ( 0.70317 0.29683 ) *
## 59) review_scores_location > 9.14647 473 610.5 Expensive ( 0.34672 0.65328 ) *
## 15) guests_included > 3.5 603 403.8 Expensive ( 0.10448 0.89552 ) *
Each node is labeled with Yes or No with specific threshold value.
Now we split our data into ration 80:20. Now we refit the model with tree but this time we will use training dataset.
set.seed(100)
train = sample(1:nrow(AirBnb_filtered_data_cat), 5000)
tree.AirBnb = tree(price_cat~., AirBnb_filtered_data_cat, subset = train)
Plot the tree model fitted with training dataset.
plot(tree.AirBnb)
text(tree.AirBnb, pretty = 0)
Next Step is to do prediction, whether our model is predicting good or not. Afterwards we evaluate the error using a missclassification table.
tree.pred = predict(tree.AirBnb_filtered_data_cat, AirBnb_filtered_data_cat[-train,], type="class")
with(AirBnb_filtered_data_cat[-train,], table(tree.pred, price_cat))
## price_cat
## tree.pred Cheap Expensive
## Cheap 1726 466
## Expensive 133 508
On diagonal are the correct classifications while off the diagonal are incorrect classifications.
(1726 + 508)/2833
## [1] 0.7885634
We only get the correct ones that has an error of 0.78.
When developing a large, bushy tree, there may be too much variation. As a result, let’s utilise cross-validation to prune the tree as efficiently as possible. Use the misclassification error rate as the foundation for pruning using cv.tree.
cv.AirBnb_filtered_data_cat = cv.tree(tree.AirBnb_filtered_data_cat, FUN = prune.misclass)
cv.AirBnb_filtered_data_cat
## $size
## [1] 8 6 3 2 1
##
## $dev
## [1] 1728 1728 1777 1981 2616
##
## $k
## [1] -Inf 0 47 214 718
##
## $method
## [1] "misclass"
##
## attr(,"class")
## [1] "prune" "tree.sequence"
plot(cv.AirBnb_filtered_data_cat)
Because of the misclassification error on 2833 cross-validated points, you can notice a downward spiral segment of the plot. So, in the downward steps 8, let’s choose a value. Then, to identify that tree, let’s trim it down to a size of 8. Let’s plot and annotate the tree to see how it turns out.
prune.AirBnb_filtered_data_cat = prune.misclass(tree.AirBnb_filtered_data_cat, best = 8)
plot(prune.AirBnb_filtered_data_cat)
text(prune.AirBnb_filtered_data_cat, pretty=0)
It’s a bit shallower than previous trees, and you can actually read the labels. Let’s evaluate it on the test dataset again.
tree.pred = predict(prune.AirBnb_filtered_data_cat, AirBnb_filtered_data_cat[-train,], type="class")
with(AirBnb_filtered_data_cat[-train,], table(tree.pred, price_cat))
## price_cat
## tree.pred Cheap Expensive
## Cheap 1726 466
## Expensive 133 508
It has done about the same as your original tree, so pruning did not hurt much with respect to misclassification errors, and gave a simpler tree.
Splitting data into train and test data
split <- sample.split(AirBnb_filtered_data_cat, SplitRatio = 0.7)
train_cl <- subset(AirBnb_filtered_data_cat, split == "TRUE")
test_cl <- subset(AirBnb_filtered_data_cat, split == "FALSE")
Fitting Naive Bayes Model to training dataset
set.seed(12345) # Setting Seed
classifier_cl <- naiveBayes(price_cat ~ ., data = train_cl)
classifier_cl
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## Cheap Expensive
## 0.6594495 0.3405505
##
## Conditional probabilities:
## neighbourhood_cleansed
## Y Bijlmer-Centrum Bijlmer-Oost Bos en Lommer Buitenveldert - Zuidas
## Cheap 0.004173623 0.003617140 0.057039510 0.011964385
## Expensive 0.000000000 0.000000000 0.017241379 0.004849138
## neighbourhood_cleansed
## Y Centrum-Oost Centrum-West De Aker - Nieuw Sloten
## Cheap 0.089593767 0.119922092 0.006121313
## Expensive 0.155711207 0.305495690 0.004849138
## neighbourhood_cleansed
## Y De Baarsjes - Oud-West De Pijp - Rivierenbuurt
## Cheap 0.169170840 0.124930440
## Expensive 0.113685345 0.108297414
## neighbourhood_cleansed
## Y Gaasperdam - Driemond Geuzenveld - Slotermeer
## Cheap 0.001669449 0.010851419
## Expensive 0.000000000 0.001616379
## neighbourhood_cleansed
## Y IJburg - Zeeburgereiland Noord-Oost Noord-West
## Cheap 0.011407902 0.007234279 0.011129661
## Expensive 0.016702586 0.003232759 0.005926724
## neighbourhood_cleansed
## Y Oostelijk Havengebied - Indische Buurt Osdorp Oud-Noord
## Cheap 0.050639955 0.005008347 0.031441291
## Expensive 0.028556034 0.004310345 0.019935345
## neighbourhood_cleansed
## Y Oud-Oost Slotervaart Watergraafsmeer Westerpark Zuid
## Cheap 0.065108514 0.024485253 0.023094046 0.097941013 0.073455760
## Expensive 0.030711207 0.009159483 0.019935345 0.075431034 0.074353448
##
## property_type
## Y Apartment Bed & Breakfast Boat Cabin Camper/RV
## Cheap 0.8336115748 0.0614913745 0.0217028381 0.0016694491 0.0022259321
## Expensive 0.7532327586 0.0226293103 0.0759698276 0.0005387931 0.0000000000
## property_type
## Y Chalet Dorm Earth House House Hut
## Cheap 0.0000000000 0.0002782415 0.0002782415 0.0653867557 0.0000000000
## Expensive 0.0000000000 0.0000000000 0.0000000000 0.1325431034 0.0000000000
## property_type
## Y Loft Other Treehouse Villa Yurt
## Cheap 0.0080690039 0.0041736227 0.0002782415 0.0002782415 0.0005564830
## Expensive 0.0102370690 0.0026939655 0.0000000000 0.0021551724 0.0000000000
##
## room_type
## Y Entire home/apt Private room Shared room
## Cheap 0.720367279 0.273233166 0.006399555
## Expensive 0.958512931 0.039870690 0.001616379
##
## accommodates
## Y [,1] [,2]
## Cheap 2.578186 1.220081
## Expensive 4.112608 2.113168
##
## bathrooms
## Y [,1] [,2]
## Cheap 1.052898 0.3134883
## Expensive 1.223556 0.4617200
##
## bedrooms
## Y [,1] [,2]
## Cheap 1.138422 0.5289212
## Expensive 1.931075 1.1183099
##
## beds
## Y [,1] [,2]
## Cheap 1.537253 1.080841
## Expensive 2.814620 2.125744
##
## bed_type
## Y Airbed Couch Futon Pull-out Sofa Real Bed
## Cheap 0.0022259321 0.0016694491 0.0044518642 0.0155815248 0.9760712298
## Expensive 0.0005387931 0.0000000000 0.0010775862 0.0032327586 0.9951508621
##
## guests_included
## Y [,1] [,2]
## Cheap 1.405398 0.7598203
## Expensive 2.123922 1.5491060
##
## extra_people
## Y [,1] [,2]
## Cheap 11.25042 15.88109
## Expensive 18.11530 22.53947
##
## minimum_nights
## Y [,1] [,2]
## Cheap 2.429883 1.935851
## Expensive 2.617996 1.719424
##
## host_response_time
## Y [,1] [,2]
## Cheap 3.732888 1.0553263
## Expensive 3.808190 0.9734246
##
## host_response_rate
## Y [,1] [,2]
## Cheap 76.45659 15.10461
## Expensive 77.40733 13.56613
##
## number_of_reviews
## Y [,1] [,2]
## Cheap 15.63077 27.93547
## Expensive 10.10938 17.85257
##
## review_scores_rating
## Y [,1] [,2]
## Cheap 92.95944 6.715819
## Expensive 93.88201 7.003579
##
## review_scores_accuracy
## Y [,1] [,2]
## Cheap 9.423933 0.7194781
## Expensive 9.472038 0.7551881
##
## review_scores_cleanliness
## Y [,1] [,2]
## Cheap 9.257155 0.8861725
## Expensive 9.322351 0.8559437
##
## review_scores_checkin
## Y [,1] [,2]
## Cheap 9.634670 0.6332843
## Expensive 9.644268 0.7129074
##
## review_scores_communication
## Y [,1] [,2]
## Cheap 9.692464 0.5823568
## Expensive 9.710545 0.5672628
##
## review_scores_location
## Y [,1] [,2]
## Cheap 9.206621 0.7844875
## Expensive 9.438555 0.6902923
##
## review_scores_value
## Y [,1] [,2]
## Cheap 9.025899 0.7930470
## Expensive 9.056944 0.7862621
##
## AirBnb_filtered_data_cat.price_cat
## Y Cheap Expensive
## Cheap 1 0
## Expensive 0 1
Predicting on test data’
y_pred <- predict(classifier_cl, newdata = test_cl)
Confusion Matrix
cm <- table(test_cl$price_cat, y_pred)
cm
## y_pred
## Cheap Expensive
## Cheap 1518 39
## Expensive 11 815
Model Evaluation
confusionMatrix(cm)
## Confusion Matrix and Statistics
##
## y_pred
## Cheap Expensive
## Cheap 1518 39
## Expensive 11 815
##
## Accuracy : 0.979
## 95% CI : (0.9724, 0.9844)
## No Information Rate : 0.6416
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.954
##
## Mcnemar's Test P-Value : 0.0001343
##
## Sensitivity : 0.9928
## Specificity : 0.9543
## Pos Pred Value : 0.9750
## Neg Pred Value : 0.9867
## Prevalence : 0.6416
## Detection Rate : 0.6370
## Detection Prevalence : 0.6534
## Balanced Accuracy : 0.9736
##
## 'Positive' Class : Cheap
##
airbnb.pca <- prcomp(AirBnb[,c(13:16,18:31)], center = TRUE, scale. = TRUE)
summary(airbnb.pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.9555 1.8606 1.25032 1.05551 0.98530 0.96323 0.92902
## Proportion of Variance 0.2124 0.1923 0.08685 0.06189 0.05393 0.05154 0.04795
## Cumulative Proportion 0.2124 0.4048 0.49163 0.55352 0.60746 0.65900 0.70695
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.89733 0.84511 0.79855 0.7348 0.70240 0.68436 0.6641
## Proportion of Variance 0.04473 0.03968 0.03543 0.0300 0.02741 0.02602 0.0245
## Cumulative Proportion 0.75169 0.79136 0.82679 0.8568 0.88420 0.91022 0.9347
## PC15 PC16 PC17 PC18
## Standard deviation 0.62137 0.56134 0.55417 0.40825
## Proportion of Variance 0.02145 0.01751 0.01706 0.00926
## Cumulative Proportion 0.95617 0.97368 0.99074 1.00000
Dropping varibales that will not be used in this model
main_data = select(AirBnb, -c(host_id,host_name,host_since_year,host_since_anniversary,id,city,state,zipcode,review_scores_accuracy, review_scores_cleanliness,review_scores_checkin, review_scores_communication, review_scores_location, review_scores_value))
dim(main_data)
## [1] 7833 17
Creating new independent data variables for model
data_new_var <- main_data %>%
mutate(bathroom_luxury = ifelse(bathrooms>0, accommodates/bathrooms,0),privacy = ifelse(bedrooms>0, beds/bedrooms,0))
Remove columns that will not be useful for clustering like price and country
clustering_data <- subset(data_new_var, select=-c(price,country))
Normalizing Function
normalize <- function(x){
return ((x - min(x))/(max(x) - min(x)))
}
Normalizing Variables before analysis
names(clustering_data)
## [1] "neighbourhood_cleansed" "property_type" "room_type"
## [4] "accommodates" "bathrooms" "bedrooms"
## [7] "beds" "bed_type" "guests_included"
## [10] "extra_people" "minimum_nights" "host_response_time"
## [13] "host_response_rate" "number_of_reviews" "review_scores_rating"
## [16] "bathroom_luxury" "privacy"
sapply(clustering_data, class)
## neighbourhood_cleansed property_type room_type
## "factor" "factor" "factor"
## accommodates bathrooms bedrooms
## "integer" "numeric" "numeric"
## beds bed_type guests_included
## "numeric" "factor" "integer"
## extra_people minimum_nights host_response_time
## "integer" "integer" "integer"
## host_response_rate number_of_reviews review_scores_rating
## "integer" "integer" "numeric"
## bathroom_luxury privacy
## "numeric" "numeric"
clustering_data_norm = mutate(clustering_data, accom = normalize(accommodates), baths = normalize(bathrooms),
reviews_count = normalize(number_of_reviews), review_rating = normalize(review_scores_rating), bedroom_count=normalize(bedrooms),
bed_count=normalize(beds), bathrom_lux = normalize(bathroom_luxury), privacy=normalize(privacy))
clustering_data_norm1 = as.data.frame(clustering_data_norm)
clustering_data_norm2 = clustering_data_norm1 %>%
cbind(acm.disjonctif(clustering_data_norm1[,c("bed_type","property_type","room_type","neighbourhood_cleansed","host_response_time")]))%>%ungroup()
Remove the variables that are coded.
clustering_data_norm3 = clustering_data_norm2 %>%
select(-property_type,-room_type,-bed_type,-neighbourhood_cleansed,-host_response_time)
Remove columns that were created for factor levels that were not represented in the sample.
clustering_data_norm4 <- clustering_data_norm3[, colSums(clustering_data_norm3!=0, na.rm =TRUE)>0]
Now run K-means and look at the within SSE Curve
SSE_curve <- c()
sum(is.na(clustering_data_norm4))
## [1] 0
for(n in 1:15){
kcluster <- kmeans((clustering_data_norm4),n)
sse <- sum(kcluster$withinss)
SSE_curve[n] <- sse
}
SSE_curve
## [1] 10023675 6694217 4829406 3812307 3382144 2617061 2346743 2249841
## [9] 2030716 1941090 1772356 1745831 1505836 1533079 1425214
Elbow Method
print("SSE curve for ideal k value")
## [1] "SSE curve for ideal k value"
plot(1:15, SSE_curve, type="b", xlab="Number of clusters", ylab="SSE", main="Elbow Curve")
kcluster<- kmeans(clustering_data_norm4, 4)
print("The size of each clusters")
## [1] "The size of each clusters"
kcluster$size
## [1] 532 2397 882 4022
kcluster$centers
## accommodates bathrooms bedrooms beds guests_included extra_people
## 1 3.003759 1.076887 1.213186 1.855263 1.689850 16.246241
## 2 3.895286 1.183805 1.656858 2.505188 2.343763 35.354610
## 3 2.919501 1.085620 1.408163 1.871864 1.487528 8.164399
## 4 2.706862 1.081500 1.298832 1.714786 1.251119 1.510194
## minimum_nights host_response_rate number_of_reviews review_scores_rating
## 1 2.062030 79.99248 89.979323 93.25000
## 2 2.415102 80.18982 9.956195 93.23614
## 3 2.798186 43.16327 6.558957 92.39633
## 4 2.560666 81.78319 7.666335 93.62522
## bathroom_luxury privacy accom baths reviews_count review_rating
## 1 2.834877 0.08995579 0.1335840 0.1346108 0.30296068 0.9156250
## 2 3.371624 0.09471770 0.1930191 0.1479757 0.03352254 0.9154518
## 3 2.729775 0.07640367 0.1279667 0.1357025 0.02208403 0.9049542
## 4 2.495843 0.07346016 0.1137908 0.1351875 0.02581258 0.9203152
## bedroom_count bed_count bathrom_lux bed_type.Airbed bed_type.Couch
## 1 0.1213186 0.05701754 0.1771798 0.003759398 0.0018796992
## 2 0.1656858 0.10034586 0.2107265 0.001251564 0.0008343763
## 3 0.1408163 0.05812425 0.1706109 0.000000000 0.0022675737
## 4 0.1298832 0.04765243 0.1559902 0.001989060 0.0014917951
## bed_type.Futon bed_type.Pull-out Sofa bed_type.Real Bed
## 1 0.011278195 0.013157895 0.9699248
## 2 0.002920317 0.005840634 0.9891531
## 3 0.004535147 0.014739229 0.9784580
## 4 0.002237693 0.014917951 0.9793635
## property_type.Apartment property_type.Bed & Breakfast property_type.Boat
## 1 0.7161654 0.12406015 0.063909774
## 2 0.7801418 0.03796412 0.063829787
## 3 0.8480726 0.02494331 0.007936508
## 4 0.8157633 0.04748881 0.033068125
## property_type.Cabin property_type.Camper/RV property_type.Chalet
## 1 0.0056390977 0.000000000 0.0000000000
## 2 0.0004171882 0.001251564 0.0000000000
## 3 0.0000000000 0.001133787 0.0000000000
## 4 0.0019890602 0.001740428 0.0002486325
## property_type.Dorm property_type.Earth House property_type.House
## 1 0.000000000 0.0000000000 0.07330827
## 2 0.000000000 0.0000000000 0.10179391
## 3 0.000000000 0.0000000000 0.10317460
## 4 0.000497265 0.0002486325 0.08378916
## property_type.Hut property_type.Loft property_type.Other
## 1 0.0000000000 0.009398496 0.007518797
## 2 0.0000000000 0.009595327 0.002503129
## 3 0.0000000000 0.012471655 0.001133787
## 4 0.0002486325 0.009448036 0.004475385
## property_type.Treehouse property_type.Villa property_type.Yurt
## 1 0.0000000000 0.0000000000 0.0000000000
## 2 0.0004171882 0.0012515645 0.0008343763
## 3 0.0000000000 0.0011337868 0.0000000000
## 4 0.0000000000 0.0009945301 0.0000000000
## room_type.Entire home/apt room_type.Private room room_type.Shared room
## 1 0.6654135 0.3345865 0.000000000
## 2 0.8694201 0.1243221 0.006257822
## 3 0.7743764 0.2176871 0.007936508
## 4 0.7916459 0.2023869 0.005967181
## neighbourhood_cleansed.Bijlmer-Centrum neighbourhood_cleansed.Bijlmer-Oost
## 1 0.001879699 0.001879699
## 2 0.001251564 0.001251564
## 3 0.000000000 0.002267574
## 4 0.004972650 0.002734958
## neighbourhood_cleansed.Bos en Lommer
## 1 0.03007519
## 2 0.04005006
## 3 0.04308390
## 4 0.04699155
## neighbourhood_cleansed.Buitenveldert - Zuidas
## 1 0.001879699
## 2 0.007926575
## 3 0.014739229
## 4 0.012680259
## neighbourhood_cleansed.Centrum-Oost neighbourhood_cleansed.Centrum-West
## 1 0.16729323 0.2951128
## 2 0.12223613 0.1989987
## 3 0.08843537 0.1519274
## 4 0.11437096 0.1636002
## neighbourhood_cleansed.De Aker - Nieuw Sloten
## 1 0.003759398
## 2 0.006257822
## 3 0.004535147
## 4 0.005221283
## neighbourhood_cleansed.De Baarsjes - Oud-West
## 1 0.1616541
## 2 0.1422612
## 3 0.1564626
## 4 0.1586275
## neighbourhood_cleansed.De Pijp - Rivierenbuurt
## 1 0.09022556
## 2 0.10972048
## 3 0.11337868
## 4 0.12307310
## neighbourhood_cleansed.Gaasperdam - Driemond
## 1 0.000000000
## 2 0.001668753
## 3 0.000000000
## 4 0.001491795
## neighbourhood_cleansed.Geuzenveld - Slotermeer
## 1 0.005639098
## 2 0.005006258
## 3 0.018140590
## 4 0.006713078
## neighbourhood_cleansed.IJburg - Zeeburgereiland
## 1 0.01315789
## 2 0.01126408
## 3 0.02040816
## 4 0.01218299
## neighbourhood_cleansed.Noord-Oost neighbourhood_cleansed.Noord-West
## 1 0.005639098 0.007518797
## 2 0.004589070 0.008343763
## 3 0.005668934 0.006802721
## 4 0.006961711 0.010442566
## neighbourhood_cleansed.Oostelijk Havengebied - Indische Buurt
## 1 0.02255639
## 2 0.03504380
## 3 0.06009070
## 4 0.04699155
## neighbourhood_cleansed.Osdorp neighbourhood_cleansed.Oud-Noord
## 1 0.005639098 0.01315789
## 2 0.004589070 0.02920317
## 3 0.004535147 0.02040816
## 4 0.005718548 0.02759821
## neighbourhood_cleansed.Oud-Oost neighbourhood_cleansed.Slotervaart
## 1 0.02067669 0.01503759
## 2 0.05047977 0.02503129
## 3 0.07482993 0.02040816
## 4 0.05271009 0.01392342
## neighbourhood_cleansed.Watergraafsmeer neighbourhood_cleansed.Westerpark
## 1 0.01127820 0.07142857
## 2 0.02503129 0.09261577
## 3 0.02494331 0.08730159
## 4 0.02262556 0.08751865
## neighbourhood_cleansed.Zuid host_response_time.1 host_response_time.2
## 1 0.05451128 0.003759398 0.007518797
## 2 0.07717981 0.001251564 0.080517313
## 3 0.08163265 0.201814059 0.000000000
## 4 0.07284933 0.000000000 0.133018399
## host_response_time.3 host_response_time.4 host_response_time.5
## 1 0.1860902 0.32706767 0.47556391
## 2 0.2027534 0.40467251 0.31080517
## 3 0.6984127 0.08390023 0.01587302
## 4 0.2076082 0.37991049 0.27946295
Adding a new column with the cluster assignment for each observation in the sample.
segment<-kcluster$cluster
clustering_data_norm5 <- cbind(clustering_data_norm4,segment)
head(clustering_data_norm5)
## accommodates bathrooms bedrooms beds guests_included extra_people
## 1 4 2 2 2 4 10
## 2 2 1 1 2 1 10
## 3 4 1 1 1 2 25
## 4 2 1 1 1 1 10
## 5 6 1 2 2 2 25
## 6 4 1 1 1 2 25
## minimum_nights host_response_rate number_of_reviews review_scores_rating
## 1 4 65 11 98.0000
## 2 3 85 108 97.0000
## 3 3 85 15 92.0000
## 4 2 85 20 97.0000
## 5 2 74 1 100.0000
## 6 2 75 0 93.3423
## bathroom_luxury privacy accom baths reviews_count review_rating
## 1 2 0.0625 0.20000000 0.250 0.037037037 0.9750000
## 2 2 0.1250 0.06666667 0.125 0.363636364 0.9625000
## 3 4 0.0625 0.20000000 0.125 0.050505051 0.9000000
## 4 2 0.0625 0.06666667 0.125 0.067340067 0.9625000
## 5 6 0.0625 0.33333333 0.125 0.003367003 1.0000000
## 6 4 0.0625 0.20000000 0.125 0.000000000 0.9167787
## bedroom_count bed_count bathrom_lux bed_type.Airbed bed_type.Couch
## 1 0.2 0.06666667 0.125 0 0
## 2 0.1 0.06666667 0.125 0 0
## 3 0.1 0.00000000 0.250 0 0
## 4 0.1 0.00000000 0.125 0 0
## 5 0.2 0.06666667 0.375 0 0
## 6 0.1 0.00000000 0.250 0 0
## bed_type.Futon bed_type.Pull-out Sofa bed_type.Real Bed
## 1 0 0 1
## 2 0 0 1
## 3 0 0 1
## 4 0 0 1
## 5 0 0 1
## 6 0 0 1
## property_type.Apartment property_type.Bed & Breakfast property_type.Boat
## 1 1 0 0
## 2 1 0 0
## 3 1 0 0
## 4 1 0 0
## 5 1 0 0
## 6 1 0 0
## property_type.Cabin property_type.Camper/RV property_type.Chalet
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## property_type.Dorm property_type.Earth House property_type.House
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## property_type.Hut property_type.Loft property_type.Other
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## property_type.Treehouse property_type.Villa property_type.Yurt
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## room_type.Entire home/apt room_type.Private room room_type.Shared room
## 1 1 0 0
## 2 0 1 0
## 3 1 0 0
## 4 1 0 0
## 5 1 0 0
## 6 0 1 0
## neighbourhood_cleansed.Bijlmer-Centrum neighbourhood_cleansed.Bijlmer-Oost
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## neighbourhood_cleansed.Bos en Lommer
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
## neighbourhood_cleansed.Buitenveldert - Zuidas
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
## neighbourhood_cleansed.Centrum-Oost neighbourhood_cleansed.Centrum-West
## 1 0 0
## 2 0 0
## 3 0 0
## 4 1 0
## 5 0 1
## 6 0 1
## neighbourhood_cleansed.De Aker - Nieuw Sloten
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
## neighbourhood_cleansed.De Baarsjes - Oud-West
## 1 0
## 2 0
## 3 1
## 4 0
## 5 0
## 6 0
## neighbourhood_cleansed.De Pijp - Rivierenbuurt
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
## neighbourhood_cleansed.Gaasperdam - Driemond
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
## neighbourhood_cleansed.Geuzenveld - Slotermeer
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
## neighbourhood_cleansed.IJburg - Zeeburgereiland
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
## neighbourhood_cleansed.Noord-Oost neighbourhood_cleansed.Noord-West
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## neighbourhood_cleansed.Oostelijk Havengebied - Indische Buurt
## 1 0
## 2 1
## 3 0
## 4 0
## 5 0
## 6 0
## neighbourhood_cleansed.Osdorp neighbourhood_cleansed.Oud-Noord
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## neighbourhood_cleansed.Oud-Oost neighbourhood_cleansed.Slotervaart
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## neighbourhood_cleansed.Watergraafsmeer neighbourhood_cleansed.Westerpark
## 1 0 1
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## neighbourhood_cleansed.Zuid host_response_time.1 host_response_time.2
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## host_response_time.3 host_response_time.4 host_response_time.5 segment
## 1 1 0 0 4
## 2 0 0 1 1
## 3 0 1 0 2
## 4 1 0 0 4
## 5 1 0 0 2
## 6 1 0 0 2
data_new_var <- as.data.frame(data_new_var)
segment <- data.frame(segment, col.names="segment")
Segment
airbnb_data_seg <- cbind(data_new_var,segment)
Need to rename the column segment to cluster
airbnb_data_seg<-rename(airbnb_data_seg, cluster = segment)
cluster1 <- subset(airbnb_data_seg, subset = airbnb_data_seg$segment == 1)
ggplot(data = airbnb_data_seg, aes(x=room_type, fill = cluster))+geom_bar(stat="count",position=position_dodge())+
facet_grid(airbnb_data_seg$cluster)+labs(x="Types of Rooms", y="Number of Rooms", title = "Distribution of various types of rooms across clusters")
Cluster 4 has the highest number of ‘Entire home/apt’ as compared to all the other clusters followed by cluster 2. The majority of ‘Private rooms’ are in cluster 4. Cluster 1, 2 and 4 has no shared rooms. Overall, there are more number of rooms of type ‘Entire home/apt’ followed by ‘Private rooms’
ggplot(data = airbnb_data_seg, aes(x=bedrooms, y=log(price), fill = cluster))+
geom_point(color = "plum", shape=23)+
geom_smooth(method = lm, se=FALSE)+
facet_wrap(airbnb_data_seg$cluster)+
labs(x="Number of bedrooms", y="Price",
title = "Relationship b/w price and number of bedrooms")
## `geom_smooth()` using formula 'y ~ x'
As the number of bedrooms increase, the log_price tends to increase. That is, there seems to exist a positive linear relationship between number of bedrooms and the log_price of the room
ggplot(data = airbnb_data_seg, aes(x=log(price),fill = cluster))+
geom_histogram(bins=15)+
facet_grid(airbnb_data_seg$cluster)+
labs(x="Price", y="Number of Rooms",
title = "Price of Rooms")
The log_price of the rooms follows a normal distribution. The cheapest room exists in cluster 1.The most expensive room lies in cluster 4 Overall, rooms in cluster 4 are the most expensive, followed by rooms in cluster 4 and 2. The log_price of rooms in cluster 4 has the highest variance while the log_price of rooms in cluster 1 has the smallest variance
Use R’s scale() function to scale all your column values
hirerachial_data_1 <- as.data.frame(scale(clustering_data_norm4))
summary(hirerachial_data_1)
## accommodates bathrooms bedrooms beds
## Min. :-1.2032 Min. :-2.8310 Min. :-1.5980 Min. :-0.595189
## 1st Qu.:-0.6342 1st Qu.:-0.2873 1st Qu.:-0.4686 1st Qu.:-0.595189
## Median :-0.6342 Median :-0.2873 Median :-0.4686 Median :-0.595189
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.000000
## 3rd Qu.: 0.5038 3rd Qu.:-0.2873 3rd Qu.: 0.6608 3rd Qu.: 0.009747
## Max. : 7.3317 Max. :17.5185 Max. : 9.6960 Max. : 8.478852
## guests_included extra_people minimum_nights host_response_rate
## Min. :-1.4338 Min. :-0.7201 Min. :-0.7949 Min. :-5.1977
## 1st Qu.:-0.5605 1st Qu.:-0.7201 1st Qu.:-0.7949 1st Qu.:-0.1251
## Median :-0.5605 Median :-0.7201 Median :-0.2681 Median : 0.5604
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3127 3rd Qu.: 0.6019 3rd Qu.: 0.2587 3rd Qu.: 0.5604
## Max. :12.5382 Max. :11.7064 Max. :12.9018 Max. : 0.6289
## number_of_reviews review_scores_rating bathroom_luxury privacy
## Min. :-0.54296 Min. :-10.9982 Min. :-2.1029 Min. :-1.4907
## 1st Qu.:-0.50371 1st Qu.: -0.2013 1st Qu.:-0.6079 1st Qu.:-0.3464
## Median :-0.34670 Median : 0.0000 Median :-0.6079 Median :-0.3464
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.04581 3rd Qu.: 0.6985 3rd Qu.: 0.8871 3rd Qu.: 0.3402
## Max. :11.11471 Max. : 0.9984 Max. : 9.8571 Max. :16.8183
## accom baths reviews_count review_rating
## Min. :-1.2032 Min. :-2.8310 Min. :-0.54296 Min. :-10.9982
## 1st Qu.:-0.6342 1st Qu.:-0.2873 1st Qu.:-0.50371 1st Qu.: -0.2013
## Median :-0.6342 Median :-0.2873 Median :-0.34670 Median : 0.0000
## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.5038 3rd Qu.:-0.2873 3rd Qu.: 0.04581 3rd Qu.: 0.6985
## Max. : 7.3317 Max. :17.5185 Max. :11.11471 Max. : 0.9984
## bedroom_count bed_count bathrom_lux bed_type.Airbed
## Min. :-1.5980 Min. :-0.595189 Min. :-2.1029 Min. :-0.04077
## 1st Qu.:-0.4686 1st Qu.:-0.595189 1st Qu.:-0.6079 1st Qu.:-0.04077
## Median :-0.4686 Median :-0.595189 Median :-0.6079 Median :-0.04077
## Mean : 0.0000 Mean : 0.000000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.6608 3rd Qu.: 0.009747 3rd Qu.: 0.8871 3rd Qu.:-0.04077
## Max. : 9.6960 Max. : 8.478852 Max. : 9.8571 Max. :24.52472
## bed_type.Couch bed_type.Futon bed_type.Pull-out Sofa bed_type.Real Bed
## Min. :-0.0375 Min. :-0.0577 Min. :-0.1102 Min. :-7.3068
## 1st Qu.:-0.0375 1st Qu.:-0.0577 1st Qu.:-0.1102 1st Qu.: 0.1368
## Median :-0.0375 Median :-0.0577 Median :-0.1102 Median : 0.1368
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.0375 3rd Qu.:-0.0577 3rd Qu.:-0.1102 3rd Qu.: 0.1368
## Max. :26.6646 Max. :17.3272 Max. : 9.0730 Max. : 0.1368
## property_type.Apartment property_type.Bed & Breakfast property_type.Boat
## Min. :-2.0108 Min. :-0.2226 Min. :-0.2087
## 1st Qu.: 0.4973 1st Qu.:-0.2226 1st Qu.:-0.2087
## Median : 0.4973 Median :-0.2226 Median :-0.2087
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4973 3rd Qu.:-0.2226 3rd Qu.:-0.2087
## Max. : 0.4973 Max. : 4.4908 Max. : 4.7907
## property_type.Cabin property_type.Camper/RV property_type.Chalet
## Min. :-0.03917 Min. :-0.0375 Min. :-0.0113
## 1st Qu.:-0.03917 1st Qu.:-0.0375 1st Qu.:-0.0113
## Median :-0.03917 Median :-0.0375 Median :-0.0113
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.03917 3rd Qu.:-0.0375 3rd Qu.:-0.0113
## Max. :25.52776 Max. :26.6646 Max. :88.4929
## property_type.Dorm property_type.Earth House property_type.House
## Min. :-0.01598 Min. :-0.0113 Min. :-0.3159
## 1st Qu.:-0.01598 1st Qu.:-0.0113 1st Qu.:-0.3159
## Median :-0.01598 Median :-0.0113 Median :-0.3159
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.01598 3rd Qu.:-0.0113 3rd Qu.:-0.3159
## Max. :62.56996 Max. :88.4929 Max. : 3.1647
## property_type.Hut property_type.Loft property_type.Other
## Min. :-0.0113 Min. :-0.09963 Min. :-0.06096
## 1st Qu.:-0.0113 1st Qu.:-0.09963 1st Qu.:-0.06096
## Median :-0.0113 Median :-0.09963 Median :-0.06096
## Mean : 0.0000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.:-0.0113 3rd Qu.:-0.09963 3rd Qu.:-0.06096
## Max. :88.4929 Max. :10.03566 Max. :16.40333
## property_type.Treehouse property_type.Villa property_type.Yurt
## Min. :-0.0113 Min. :-0.03197 Min. :-0.01598
## 1st Qu.:-0.0113 1st Qu.:-0.03197 1st Qu.:-0.01598
## Median :-0.0113 Median :-0.03197 Median :-0.01598
## Mean : 0.0000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.:-0.0113 3rd Qu.:-0.03197 3rd Qu.:-0.01598
## Max. :88.4929 Max. :31.27299 Max. :62.56996
## room_type.Entire home/apt room_type.Private room room_type.Shared room
## Min. :-2.0312 Min. :-0.483 Min. :-0.07685
## 1st Qu.: 0.4923 1st Qu.:-0.483 1st Qu.:-0.07685
## Median : 0.4923 Median :-0.483 Median :-0.07685
## Mean : 0.0000 Mean : 0.000 Mean : 0.00000
## 3rd Qu.: 0.4923 3rd Qu.:-0.483 3rd Qu.:-0.07685
## Max. : 0.4923 Max. : 2.070 Max. :13.01003
## neighbourhood_cleansed.Bijlmer-Centrum neighbourhood_cleansed.Bijlmer-Oost
## Min. :-0.05543 Min. :-0.04663
## 1st Qu.:-0.05543 1st Qu.:-0.04663
## Median :-0.05543 Median :-0.04663
## Mean : 0.00000 Mean : 0.00000
## 3rd Qu.:-0.05543 3rd Qu.:-0.04663
## Max. :18.03700 Max. :21.44076
## neighbourhood_cleansed.Bos en Lommer
## Min. :-0.2127
## 1st Qu.:-0.2127
## Median :-0.2127
## Mean : 0.0000
## 3rd Qu.:-0.2127
## Max. : 4.7014
## neighbourhood_cleansed.Buitenveldert - Zuidas
## Min. :-0.1041
## 1st Qu.:-0.1041
## Median :-0.1041
## Mean : 0.0000
## 3rd Qu.:-0.1041
## Max. : 9.6041
## neighbourhood_cleansed.Centrum-Oost neighbourhood_cleansed.Centrum-West
## Min. :-0.3648 Min. :-0.4717
## 1st Qu.:-0.3648 1st Qu.:-0.4717
## Median :-0.3648 Median :-0.4717
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.3648 3rd Qu.:-0.4717
## Max. : 2.7410 Max. : 2.1195
## neighbourhood_cleansed.De Aker - Nieuw Sloten
## Min. :-0.07342
## 1st Qu.:-0.07342
## Median :-0.07342
## Mean : 0.00000
## 3rd Qu.:-0.07342
## Max. :13.61897
## neighbourhood_cleansed.De Baarsjes - Oud-West
## Min. :-0.4259
## 1st Qu.:-0.4259
## Median :-0.4259
## Mean : 0.0000
## 3rd Qu.:-0.4259
## Max. : 2.3474
## neighbourhood_cleansed.De Pijp - Rivierenbuurt
## Min. :-0.3616
## 1st Qu.:-0.3616
## Median :-0.3616
## Mean : 0.0000
## 3rd Qu.:-0.3616
## Max. : 2.7649
## neighbourhood_cleansed.Gaasperdam - Driemond
## Min. :-0.03575
## 1st Qu.:-0.03575
## Median :-0.03575
## Mean : 0.00000
## 3rd Qu.:-0.03575
## Max. :27.96784
## neighbourhood_cleansed.Geuzenveld - Slotermeer
## Min. :-0.08636
## 1st Qu.:-0.08636
## Median :-0.08636
## Mean : 0.00000
## 3rd Qu.:-0.08636
## Max. :11.57733
## neighbourhood_cleansed.IJburg - Zeeburgereiland
## Min. :-0.1143
## 1st Qu.:-0.1143
## Median :-0.1143
## Mean : 0.0000
## 3rd Qu.:-0.1143
## Max. : 8.7490
## neighbourhood_cleansed.Noord-Oost neighbourhood_cleansed.Noord-West
## Min. :-0.07769 Min. :-0.09631
## 1st Qu.:-0.07769 1st Qu.:-0.09631
## Median :-0.07769 Median :-0.09631
## Mean : 0.00000 Mean : 0.00000
## 3rd Qu.:-0.07769 3rd Qu.:-0.09631
## Max. :12.87006 Max. :10.38161
## neighbourhood_cleansed.Oostelijk Havengebied - Indische Buurt
## Min. :-0.2123
## 1st Qu.:-0.2123
## Median :-0.2123
## Mean : 0.0000
## 3rd Qu.:-0.2123
## Max. : 4.7087
## neighbourhood_cleansed.Osdorp neighbourhood_cleansed.Oud-Noord
## Min. :-0.07253 Min. :-0.1643
## 1st Qu.:-0.07253 1st Qu.:-0.1643
## Median :-0.07253 Median :-0.1643
## Mean : 0.00000 Mean : 0.0000
## 3rd Qu.:-0.07253 3rd Qu.:-0.1643
## Max. :13.78494 Max. : 6.0844
## neighbourhood_cleansed.Oud-Oost neighbourhood_cleansed.Slotervaart
## Min. :-0.235 Min. :-0.1359
## 1st Qu.:-0.235 1st Qu.:-0.1359
## Median :-0.235 Median :-0.1359
## Mean : 0.000 Mean : 0.0000
## 3rd Qu.:-0.235 3rd Qu.:-0.1359
## Max. : 4.255 Max. : 7.3590
## neighbourhood_cleansed.Watergraafsmeer neighbourhood_cleansed.Westerpark
## Min. :-0.1529 Min. :-0.3105
## 1st Qu.:-0.1529 1st Qu.:-0.3105
## Median :-0.1529 Median :-0.3105
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.1529 3rd Qu.:-0.3105
## Max. : 6.5387 Max. : 3.2198
## neighbourhood_cleansed.Zuid host_response_time.1 host_response_time.2
## Min. :-0.2825 Min. :-0.1547 Min. :-0.321
## 1st Qu.:-0.2825 1st Qu.:-0.1547 1st Qu.:-0.321
## Median :-0.2825 Median :-0.1547 Median :-0.321
## Mean : 0.0000 Mean : 0.0000 Mean : 0.000
## 3rd Qu.:-0.2825 3rd Qu.:-0.1547 3rd Qu.:-0.321
## Max. : 3.5393 Max. : 6.4651 Max. : 3.114
## host_response_time.3 host_response_time.4 host_response_time.5
## Min. :-0.5926 Min. :-0.7347 Min. :-0.6123
## 1st Qu.:-0.5926 1st Qu.:-0.7347 1st Qu.:-0.6123
## Median :-0.5926 Median :-0.7347 Median :-0.6123
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6873 3rd Qu.: 1.3610 3rd Qu.: 1.6330
## Max. : 1.6873 Max. : 1.3610 Max. : 1.6330
Notice that means of all the attributes are zero and standard deviation is equal to one.
All the values here are continuous numerical values, here we will use the euclidean distance method.
hirerachial_data_2 <- dist(hirerachial_data_1, method = 'euclidean')
Applying Linkage Method
hirerachial_data_3 <- hclust(hirerachial_data_2, method = "ward.D2")
Plot the hierarchical clustering
plot(hirerachial_data_3, hang=-1, cex=0.7)
Set the K value to 3 (clusters) and plot
If you visually want to see the clusters on the dendrogram you can use R’s abline() function to draw the cut line and superimpose rectangular compartments for each cluster on the tree with the rect.hclust() function as shown in the following code:
k_hirerachina_data_3 <- cutree(hirerachial_data_3, k = 4)
plot(hirerachial_data_3)
rect.hclust(hirerachial_data_3 , k = 4, border = 2:6)
abline(h = 4, col = 'red')
Now we can see the three clusters enclosed in three different colored boxes. We can also use the color_branches() function from the dendextend library to visualize our tree with different colored branches.
suppressPackageStartupMessages(library(dendextend))
avg_dend_obj <- as.dendrogram(hirerachial_data_3)
avg_col_dend <- color_branches(avg_dend_obj, h = 4)
plot(avg_col_dend)
Now we will append the cluster results obtained back in the original dataframe under column name the cluster with mutate(), from the dplyr package and count how many observations were assigned to each cluster with the count() function.
suppressPackageStartupMessages(library(dplyr))
hirerachial_c1 <- mutate(hirerachial_data_1, cluster = k_hirerachina_data_3)
count(hirerachial_c1,cluster)
## cluster n
## 1 1 5878
## 2 2 141
## 3 3 570
## 4 4 1244
It’s common to evaluate the trend between two features based on the clustering that you did in order to extract more useful insights from the data cluster-wise.
suppressPackageStartupMessages(library(ggplot2))
ggplot(hirerachial_c1, aes(x=beds, y = bedrooms, color = factor(cluster))) + geom_point()